System and methods for content and contention-aware approximate object detection

ABSTRACT

System and methods for content- and contention-aware object detection are provided. A system may receive video information and perform object detection and object tracking based on an execution configuration. The system may approximate an optimized execution configuration. To approximate the optimized execution configuration, the system may identify, based on the video information, a plurality of content features. The system may further measure a contention level of a computer resource or multiple resources. The system may approximate, based on the content features and the utilization metric, latency metrics, for a plurality of execution configuration sets, respectively. The system may also approximate, based on the content features, accuracy metrics for the execution configuration sets, respectively. The system may select the optimized execution configuration set in response to satisfaction of a performance criterion. The system may perform object detection and object tracking based on the optimized execution configuration set.

CROSS REFERENCE TO RELATED APPLICATIONS

This application claims the benefit of U.S. Provisional Application No.63/168,393 filed Mar. 31, 2021, which is herein incorporated byreference in its entirety.

GOVERNMENT RIGHTS

This invention was made with government support under CCF 1919197awarded by the National Science Foundation. The government has certainrights in the invention.

TECHNICAL FIELD

This disclosure relates to computer vision and, in particular, tomachine learning and resource contention management.

STATEMENT REGARDING PRIOR DISCLOSURES BY THE INVENTORS OR JOINTINVENTORS UNDER 37 C.F.R. 1.77(B)(6)

The joint inventors of the present disclosure, Somali Chaterji, SaurabhBagchi, and Ran Xu, publicly disclosed information related to thepresent disclosure in article “ApproxDet: content and contention-awareapproximate object detection for mobiles.” In Proceedings of the 18thConference on Embedded Networked Sensor Systems, pp. 449-462. 2020.(Appeared in ACM-SenSys 2020). The article was published on Nov. 16,2020, which is less than one year prior to the filing date of the U.S.Provisional Application Ser. No. 63/168,393, filed Mar. 31, 2021. A copythe article will be provided an Information Disclosure Statement (IDS).

BACKGROUND

Mobile devices with integrated cameras have seen tremendous success invarious domains. Equipped with increasingly powerful System-on-Chips(SoCs), mobile augmented reality (AR) devices such as the MicrosoftHololens and Magic Leap One, along with next generation mobile devices,are opening up a plethora of new continuous mobile vision applicationsthat were previously deemed impossible. These applications range fromdetection of objects around the environment for immersive experience inAR games such as Pokemon-Go, to recognition of road signs for providingdirections in real-time, to identification of people for interactivephoto editing, and to Manchester City's AR-driven stadium tour. Afundamental vision task that all of these applications must perform, isobject detection on the live video stream that the camera is capturing.To maintain the immersive experience of the user (e.g., for AR games) orto give usable output on time (e.g., for road sign recognition), suchtasks should be performed in near real-time with very low latency.

BRIEF DESCRIPTION OF THE DRAWINGS

The embodiments may be better understood with reference to the followingdrawings and description. The components in the figures are notnecessarily to scale. Moreover, in the figures, like-referenced numeralsdesignate corresponding parts throughout the different views.

FIG. 1 illustrates a first example of a system.

FIG. 2 illustrates an example of a multi-branch object detector.

FIG. 3 illustrates an example of logic for a scheduler of a multi-branchobject detector.

FIG. 4 illustrates an example of logic for a multi-branch objectdetector.

FIG. 5 illustrates a second example of a system.

DETAILED DESCRIPTION

Computer vision and computer systems research working together has madesignificant progress in lightweight object detection applicable tomobile settings for still images in recent years, thanks to developmentof efficient deep neural networks (DNNs). However, directly applyingimage-based object detectors to video streams suffers, especially inmobile settings. First, applying a detector on all video framesintroduces excessive computational cost and would often violate thelatency requirements of our target continuous vision applications.Second, image-based object detectors are not cognizant of thesignificant temporal continuity that exists in successive video frames(e.g., a static scene with a slowly moving object), unable to map thisto the latency budget. To overcome these algorithmic challenges, thecomputer vision community has proposed some DNN models for video objectdetection and tracking.

Despite these efforts, challenges of video object detection (both 2D and3D) for continuous vision applications on resource constrained devicesremain largely unsolved. A major shortcoming is that none of theexisting approaches can adapt to runtime condition changes, such as thecontent characteristics of the input videos, and the level of contentionon the edge device. Modern mobile devices come with increasinglypowerful System-on-Chips (SoCs) having multiple heterogeneous processingunits, and no longer process just a single application at a time. Forexample, both iOS and Android support multiple background tasks, such asan always-on personal assistant, e.g., Siri running a DNN for speechrecognition (GPU contention), or a firewall constantly inspectingpackets (memory-bandwidth contention). These tasks can runsimultaneously with a continuous vision application that requires avideo object detector, leading to unpredictable resource contention onmobile devices similar to a traditional server setting.

Such concurrent applications or background tasks can compete with objectdetection, drastically increasing the object detector's latency.Consider the example of a widely used DNN-based object detector: FasterR-CNN (FRCNN), integrated with MedianFlow (MF) object tracking andoptimized for may have a latency requirement of 100 milliseconds (ms)².Without contention, the detector has a latency of ≈64 ms. However, asthe GPU contention level increases a drastic increase in detectionlatency follows. While the accuracy remains the same, the latency of thedetector fluctuates significantly and violates the latency requirement.Different from server-class devices, mobile devices may have limitedability to isolate co-located applications from interference, stemmingfrom the paucity of VM-like isolation mechanisms.

To address these and other technical issues, the system and methodsdescribed herein are provided. In various examples, the system takesboth video content-awareness and resource contention awareness withinits ambit. In contrast to the static FRCNN+MF, the system manages tokeep a latency below the requirement with increased level of contentionwhile achieving a better accuracy. To this end, the system may use amodel with multiple approximation parameters that are dynamically tunedat runtime to stay on the Pareto optimal frontier (of thelatency-accuracy curve in this case). We refer to the execution branchwith a particular configuration set an approximation branch (AB).

The system and methods described herein offer various technicaladvantages. First, the systems and methods herein model the impacts ofthe contention level to the latency of the ABs. Second, the system andmethods described herein combine an offline trained latency predictionmodel and an online contention sensor to precisely predict the latencyof each AB in our system. Thus, the system and methods described hereincan adapt to resource contention at a given latency budget at runtime,an ability especially critical for the deployment on edge devices astheir resources are limited and shared. Third, the system and methodsdescribed herein further consider how the video content influences bothaccuracy and latency. The system and methods described herein leveragesvideo characteristics such as the object motion (fast vs. slow) and thesizes and the number of objects, to better predict the accuracy andlatency of the ABs, and to select the best AB with reduced latency andincreased accuracy. Additional benefits, efficiencies, and improvementsover existing market solutions are made evident in the systems andmethods described herein.

FIG. 1 illustrates a first example of a system 100. The system 100 mayinclude an adaptive object detection framework (AODF) 101. The AODF mayinclude a scheduler 102 and a multibranch object detection framework104. The multi-branch detector 104 may receive a video information, suchas a video frame, and an execution configuration as an input. Thescheduler 104 may govern which configuration the detection frameworkshould utilize to perform the detection.

The multi-branch detector 102 may include an object detector 106 and anobject tracker 108, which allows both object tracking and detection.This follows the practice for video object detection that combines theheavy-weight detection and the light-weight tracker. The multi-branchdetector may receive configuration parameters which govern operation ofthe multi-branch detector and associated object detection and objecttracking, whether 2D or 3D. Thus, the configuration parameters may beregarded as tuning parameters which can be modified to adjustperformance of object tracking/detection. The system and methodsdescribed herein can also be applied to object classification, which isa simpler computer vision task than object detection.

The schedule 104 may tune the execution configuration of themulti-branch detector based on the features derived from the video andcomputer resource contention. The execution configuration may be tunablewithin a dynamic range. Accordingly, a multi-dimensional configurationspace can be created resulting in multiple possible ABs. The accuracyand the latency (execution time) are different for each AB and thevalues depend upon the video content characteristics (e.g., still versusfast-moving) and the compute resources available (e.g., lightly-loadedversus heavily-loaded mobile). To efficiently select an AB at runtimeaccording to the given (and possibly changing) user requirement, thescheduler estimates the current latency and accuracy of each branch. Thescheduler then selects the most accurate/fastest branch according to thespecified performance criterion.

The scheduler may run occasionally, periodically, or a according to arule to re-calibrate the best approximation branch and determine therespective configuration for the multi-branch detector. In someexamples, the schedule may establish a new approximation branch (andthus a new configuration) based on a learnable interval called“scheduler interval”. The scheduler interval may be value which triggersthe scheduler to run. In some examples, the scheduler interval can be atime or number of frames that the configuration of the detectionframework can be maintained.

Multi-Branch Object Detection Framework

FIG. 2 illustrate an example of a multi-branch object detector 102. Theobject detector may perform object detection based on an objectdetection model. For example, the object detection model may include,for example, a deep neural network (DNN). There are various non-limitingexamples of DNN's for the object detection are described below.

Given an input image or video frame, object detector aims at locatingtight bounding boxes of object instances from target categories. Interms of network architecture, a CNN-based object detector can bedivided into the backbone part that extracts image features, and thedetection part that classifies object regions based on the extractedfeatures. The detection part can be further divided into two-stage andsingle-stage detectors. Two-stage detectors usually make use of RegionProposal Networks (RPN) for generating regions-of-interest (RoIs), whichare further refined through the detection head and thus more accurate.

The overwhelming majority of work on lightweight object detection is forimages, e.g., YOLOv3 and SSD, thus being agnostic to videocharacteristics inherent to the temporal relation between image frames.In some preferred examples, the detection DNN may include theFaster-RCNN with ResNet-50 as the backbone. Faster-RCNN is an accurateand flexible framework for object detection and a canonical example of atwo-stage object detector. An input image or video frame is firstresized to a specific input shape and fed into a DNN, where imagefeatures are extracted. Based on the features, a RPN identifies apre-defined number of candidate object regions, known as regionproposals. Image features are further aggregated within the proposedregions, followed by another DNN to classify the proposals into eitherbackground or one of target object categories and to refine the locationof the proposals. Our key observation is that the input shape and thenumber of proposals have significant impact to the accuracy and latency.Therefore, we propose to expose input shape and number of regionproposals as tuning parameters.

Alternative or in addition, the object detector may perform single-stageobject detection. Without using region proposals, these models areoptimized for efficiency and oftentimes less flexible. Examples ofsingle stage object detection may include YOLO. Single-stage objectdetection may simplify object detection as a regression problem bydirectly predicting bounding boxes and class probabilities without thegeneration of region proposals.

Object tracking is the other aspect of the multi-branch detector. Theobject tracker 108 may locate moving objects over time within a video.The object tracker, as described herein may focus on motion-based visualtracking due to its simplicity and efficiency. In some examples, theobject tracker may assume the initial position of each object is givenin a starting frame, and makes use of local motion cues to predict theobject's position in the next batch of frames.

The object tracker may access one or more object tracking frameworks 204which perform object tracking with various degrees of accuracy andefficiency with a given set in of input data. The object trackingframeworks 204 may include model(s) and/or logic for performing objecttracking. For example, the object tracking frameworks may include a setof existing motion-based object trackers, such as MedianFlow, KCF, CSRT,Dense Optical Flow and/or any other suitable trackers. A key differencebehind various object trackers lies in the extraction of motion cues,via e.g., optical flow or correlation filters, leading to varyingaccuracy and efficiency under different application scenarios.Accordingly, the multi-branch object detector may enable the adaptivechoice of the trackers as one of the tuning variables described herein.

Another important factor of object tracking performance is the inputresolution to a motion-based tracker. A down sampled version of theinput image allows improves capturing of large motion and thus trackingfast-moving objects, while a high-resolution input image facilitates theaccurate tracking of objects that move slowly. Therefore, themulti-branch object detector 102 may receive the down sampling ratio ofthe input image as another tuning parameter for tracking.

Accordingly, to support the runtime adaptive object detection frameworkon videos, the multibranch object detector 102 may operate with lightswitching overheads among branches for mapping to runtime changes.Different from object detection on still images, videos have temporalsimilarities and an object tracker is used to reduce the runtime costwith minor accuracy drop.

The object detector may perform object detection in a sampling intervalwhile the tracker may track objects between successive frames in thesampling interval. In other words, the object detector may performcomputer vision tasks such as object classification, objectlocalization, object detection (in some ways, together these three arewithin the ambit of object recognition), activity recognition, etc.Essentially, object detection does object classification and then ALSO,in some examples, may define a bounding box around each object ofinterest in the image and then assigns a class label to each object witha certain probability. Alternatively or in addition, the object detectormay perform vanilla object detection and video object detection. Anadvantage afforded by the system described is that one can leverage thetemporal continuity of frames in a group-of-frames (GoF) within a timewindow in a continuous video and remove redundant steps. For example,some frames may be repetitive and detection may be suspended and,instead, only lightweight tracking may be performed. In fact, thiswindow is something we can learn from the characteristics of the videoor may include a fixed window, such as 8 frames. Accordingly, the systemmay perform compute-intensive object detection for the first frame andobject “tracking” (essentially following the detected objects) for therest of the window (i.e. 7 frames). This is essentially the Samplinginterval (si) tuning parameter in our algorithm, also listed in Table 1below.

Non-liming examples of the tuning parameters described herein includethose listed in Table 1, though other parameters are possible.

TABLE 1 Tuning Parameter Examples Tuning Parameter Summary DescriptionSampling interval For every frame, we run the heavy weight object (si)detection DNN on the n frame(s) and light-weight object tracker on therest of the frames. Input shape The resized shape of the video framethat is fed (shape) into the detection DNN. Number of The number ofproposals generated from the Region proposals (nprop) Proposal Networks(RPN) in our detection DNN. Tracker type Type or identifier of objecttracker. (tracker) Down-sampling The downsampling ratio of the frameusedby ratio (ds) the object tracker.

Generally, it was empirically observed through various experimentationthat smaller si, larger shape, more nprop, and smaller ds will raise theaccuracy and vice versa.

Scheduler

Referring back to FIG. 1, a deeper discussion of the scheduler follows.The scheduler 104 may perform the decision-making at runtime on which AB(aka execution configuration set) should be used to run the inference onthe input video frames. Formally, the scheduler 104 maximizes theestimated detection accuracy of the system given a latency requirementL_(req). This is done by identifying a feasible set of branches thatsatisfy the target latency requirements, and choosing the most accuratebranch. In case of an empty feasible set, the fastest branch isreturned. Thus, we formulate the optimal AB b_(opt) as follows,

$b_{opt} = \left\{ \begin{matrix}{{{argmax}_{b \in \hat{B}}\left( A_{b} \right)}\ ,} & {{{{if}\ \hat{B}} \neq \varnothing},} \\{{argmin}_{b \in \hat{\beta}}\left( L_{{est},b} \right)} & {otherwise}\end{matrix} \right.$

where {circumflex over (β)} is all ABs considered, {circumflex over (B)}is the feasible set, i.e., {circumflex over (B)}={b∈{circumflex over(β)}} if L_(est,b)<L_(req), A_(b) and L_(est,b) are the estimatedaccuracy and latency of the AB respectively. The search space{circumflex over (β)}, which includes five orthogonal knobs, hasmillions of states.

To further reduce the scheduler overhead and enhance our systemrobustness, the scheduler may make a decision every sw frames. Themotivation of introducing sw is to prevent the scheduler to make veryfrequent decisions. When sw=max (8, si), the schedulers make a decisionat least every 8 frames. When the scheduler chooses a branch with a longsi, it will make a following decision every si frames. In addition tothe latency of the detection and tracking kernels, switching overheadL_(sw) and the scheduler overhead L_(sc) may be included in the overalllatency estimation of an AB b, i.e.,L_(est,b)=L_(b,fr)+(L_(sw)+L_(sc))/sw. The light-weight online featureextractors may be designed so that they can adapt seamlessly to thecontent and contention changes.

The scheduler may include a content-aware feature extractor 102 and acontention sensor 116. The content-aware may extract features from thevideo information. The features may include, for example, height, width,tracking of the object information of the last frame(s), and calculatesthe object movements of the past few frames The contention sensor maydetect resource contention level(s). The accuracy model and a latencymodel may be trained offline to support such estimation during runtimebased on the contention level and/or content features.

Configuration of the Tuning Parameters

As previously discussed, the tuning parameters may include the samplinginterval (si), the input image size (shape) to the detection DNN, thenumber of proposals (nprop) in the detection DNN, the type of objecttracker (tracker) and the downsampling ratio of the input to the tracker(ds). We now describe the implementation details of these parameters,including example data types and example value ranges.

Sampling Interval (si). si defines the interval of running the objectdetector. The object tracker runs on the following (si−1) frames. Forexample, our system runs object detection on every frame when si=1. Toreduce the search space of si, we constrain si in a preset set—{1, 2, 4,8, 20, 50, 100}. These pre-defined si are chosen empirically to covercommon video object detection scenarios. With the max value of si=100,the detector runs at a large interval of 3-4 seconds and the trackerruns in-between.

Input Video Frame Shape to Detector (shape). The shape defines theshortest side of the input video frame to the object detector. The valueof shape must be a multiple of 16 to make the precise alignment of theimage pixels and the feature map. We set the shape range from 224 to576, since smaller shape than 224 significantly reduces the accuracy andlarger shape than 576 will result in heavy computational burden and doesnot improve the accuracy based on results on the validation set.

Number of Proposals (nprop). The nprop controls the number of candidateregions considered for classification in the object detector. We limitthe value of nprop (integer) between 1 and 100. With nprop=1, only thetop ranked proposal from RPN is used for detection. Increasing npropwill boost the detector's performance yet with increased computationalcost and runtime.

Type of Trackers (tracker). The tracker defines which tracker to usefrom MedianFlow, KCF, CSRT, and dense optical flow trackers. Thesetrackers are selected based on their efficiency and accuracy. Differenttrackers have varying performance under different scenarios. Forexample, CSRT tracker is most accurate among these trackers, but is alsomost time consuming. MedianFlow tracker is fast and accurate when anobject move slowly in the video, yet have poor performance for a fastmoving object. We use the implementation from OpenCV for all trackers.

Downsampling Ratio for the Tracker (ds). The ds controls the input imagesize to the tracker. The value of ds is limited to 1, 2, and 4, i.e., nodownsampling, downsampling by a factor of 2 and 4, respectively. Alarger ds reduces the computational cost, and favors the tracking offast moving objects. A smaller ds increase the latency, yet provide moreaccurate tracking of slowly moving objects.

Content Feature Extraction

To start with, content features have great impact on both the accuracyand latency of each AB based on the following observations—(1) trackerlatency is affected by the number and area of the objects becausetracker algorithms take the bounding boxes of the detection frames asinputs and calculate features inside each box; (2) both detection andtracker accuracy are affected by the content in the video. For example,detection DNNs perform consistently poorly with small objects on MS COCOdataset, including Faster-RCNN, SSD, and YOLO. Moreover, both detectionDNN and tracker find it harder to deal with fast-moving objects. Someprevious works mention that movement between frames can be used as afeature to trigger the heavy detection process. This implies that forvideo object detection systems, we need to extract these contentfeatures to improve the accuracy and latency of our models. Thefollowing discussion considers two types of content features.

Object Basic Features. Object basic features may include a number ofobjects and the summed area of the objects. These features may be usedfor modeling the tracker latency. The intuition is that somelight-weight trackers' latency increases proportionally with the numberof objects and the area of the objects since each object is trackedindependently, and the larger the area, the more tracking-relatedfeatures need computation. In was empirically verified, throughexperimentation according to various examples, that the latency of theobject trackers is affected by both the number and sizes of the objects.In some experimentation, 10% of the ImageNet video object detection(VID) training dataset was used to generate the latency data samples,though additional or fewer may be used in practice, depending on theimplementation.

Object Movement Features. The recent movement of objects may be used asa feature for modeling the framework accuracy. The features may beexpress as a measurement of distance. More rigorously, the movement maybe defined as the Euclidean distance of the objects' centers. In someexamples, the content feature extractor may take the mean movement ofall, or a large subset of, the objects in the recent frames. Theintuition is that the faster the objects move in the video frame, thelower the accuracy, especially for the execution branches with highersampling interval. Experimental results, according to various examples,show that the accuracy of high si branches (si=100) does not dropsignificantly (≈10%) on slow moving videos but reduces (>30%) on fastmoving videos.

Latency Modeling

The latency model may aim to predict the frame-wise latency of each ABfor future frames. L_(fr) may be denoted as the per-frame latency of ouradaptive object detection framework. L_(fr) is a function of the DNNbased detection latency L_(DNN) and the tracking latency L_(tracker). Ifobject detection DNN runs every si frames (sampling interval), thelatency L_(fr) is given by

${L_{fr} = {\frac{L_{DNN}}{si} + L_{tracker}}},$

The models of the detection latency L_(DNN) and the tracking latencyL_(tracker) are respectively described below.

Latency Prediction for Object Detection DNN.

The latency of the object detection DNN (L_(DNN)) is jointly determinedby the at least two configuration parameters for the multi-branchdetector—the input image size shape and the number of proposals nprop.Moreover, considering the input shape of frames may vary in differentvideos, we add the height and width of the input image as additionalfeatures. These features could be ignored if the video source is a videocamera (which outputs fixed sized frames). Besides the input shape ofvideo frames, system contention (CPU/GPU usage and memory bandwidth, asdetailed below) will also affect the DNN latency. Thus, the latencyequation of the DNN is given by

L _(DNN) =f _(DNN)(nprop,shape,height,width,contention)

We fit a quadratic regression model for f_(DNN) to characterize thelatency of the detection DNN. Once trained, the regression model isevaluated on a subset of the test set (sparsely sampled), where the meansquared error (MSE) between the prediction {circumflex over (L)}_(DNN)and the ground-truth L_(DNN) latency are reported.

Latency Prediction for Object Trackers

The number of objects and average sizes of objects play a major role forthe tracking latency. A model f_(tracker) may characterize the latencyof the object tracker under the system contention. Similar to thedetection latency model, we also add the height and width of the inputimage as additional features. Thus, f_(tracker) is given by:

L _(tracker) =f _(tracker)(height,width,n_obj,avg_size,contention)

We fit quadratic regression models to the ground-truth L_(tracker).Moreover, since the model depends on n_obj and avg_size of the previousframe, we use the previous frame's n_obj and avg_size to trainL_(tracker). After the training process, we compute the predicted{circumflex over (L)}_(Tracker) and measure the MSE across a subset ofthe test set.

Accuracy Modeling

Accuracy prediction models aim to predict the expectation of theaccuracy of each AB for near future frames. The accuracy of an objectdetector is usually defined by the metric mean average precision (mAP).However, predicting the absolute mAPs given a test video is difficult.To address this issue, the absolute mAP metric may be converted into arelative percentage metric. More precisely, a base branch is identifiedin the detection framework using the detection-only branch (si=1) withnprop=100 and shape=576. This base branch sets the performanceupperbound for all approximation branches (62.3% mAP on the validationset). The mAP of each AB is normalized to its percentage value bydividing its mAP by the base branch's mAP.

Different from the latency models, the factors on the accuracy arecoupled all together (i.e., no distinction between detection DNN andtracking). Thus, a single unified model may be given by:

A=f _(A)(si,shape,nprop,tracker,ds,movement)

where tracker is the tracker type, ds is the downsampling ratio of theinput to the tracker, and movement is the object movement featuresextracted from the video content. A decision tree model f_(A) waslearned to predict the accuracy A, trained with the MSE loss across thewhole training dataset.

Synthetic Contention Generator

A synthetic contention generator was a tool developed to study theadaptive object detection framework according to various examples andembodiments described herein. The synthetic contention generator testshow well the adaptive object detection framework can adapt to varyingresource contention on the same device on which it is running. It isused to derive experimental results in Xu et al., ApproxDet: Content andContention-Aware Approximate Object Detection for Mobiles, ACM-SenSys,(2020).

Synthetic Contention Generator (CG) is designed as a stand-in for anyresource contention on the device. A detection framework may suffer fromunpredictable levels of resource contention when it is running on mobileplatforms due to the instantiation of other co-located applications, forwhich we will not have information. At least three important types ofresources are available on mobile platforms—CPU, memory bandwidth (MB),and GPU. CPU may be controlled contention by the number of CPU cores ourCG occupies. We control MB contention by the amount of memory-to-cachebandwidth that it consumes. The code is modified from the widely usedSTREAM benchmark that is meant to measure the MB capacity of a chip. Forthe GPU contention, we control the number of GPU cores that areutilized. The three-dimensional CG is orthogonal, which means we cantune each dimension without affecting the other dimensions. The CG isrepresentative because we executed and mapped the contention caused bysome widely used applications in the 3D contention space (Table 2). Thefirst one is an anomaly detection program that uses Robust Random CutForest (RRCF) to detect anomalies from a local temperature and humiditysensor data. We also used our two object detection DNNs, namely FasterR-CNN and YOLOv3, for checking how much contention they can generate.

TABLE 2 Applications running in the 3D contention space Real Apps CPU MB(MB/s) GPU Anomaly detention 99.80% 500   0% Faster R-CNN 69.75% 1000  99% YOLOv3 65.85% 800 98.50%

Profiling Cost and Sub-Sampling

The cost of collecting ground truth data with design features forperformance prediction models is significant without proper samplingtechniques. We measure our profiling cost for the accuracy, detectionlatency, and tracker latency models in Table 3.

TABLE 3 Cost of profiling. Task Cost Framework accuracy 2,414 hr · core(20% of the configurations) Detection latency 7 hr · machine w/15 out 1million sampling Tracker latency 1 hr · machine w/169 out 1 millionsampling

To efficiently collect the profiling data, we use the master and workermodel, where the master node manages a list of configurations of thedetection framework and distributes the profiling work, while workersrun the particular configuration to collect the training data for themodeling. As the feature space is huge, we sparsely sample themulti-dimensional space of (“number of proposals”, “resized shape”,“sampling interval”, “tracker”, “down-sampling ratio of the tracker’).We finally use 20% of the configurations to train our accuracy model.

Similar sub-subsampling techniques are used for the latency models aswell, and we sample data points on videos of various height and width,various numbers of objects and object sizes, under discrete 3Dcontention levels. During experimentation, 15 out of a million featurepoints were used to train our detection latency model and 169 out of amillion feature points to train our tracker latency model.

FIG. 3 illustrates example logic for the scheduler 104. The contentfeature extractor 114 may detect object in the video information (302).For example, the content feature extractor 114 may manage/generate thecontent features of the video by extracting height, width from currentframe, memorizing n_obj, avg_size of last frame and movement from pastframes. It is lightweight in terms of the compute load it puts on thetarget platform and this is desirable since we have to extract thefeatures at runtime on the target board for feeding into our models.

The contention sensor 116 may sense the contention level of one or morecomputer resources (304). Contention level is a measure of resourcecompetition on the device between the adaptive object detectionframework (or a subcomponent thereof) and background concurrentapplications. The higher the contention level, the more the adaptiveobject detection framework is affected by the concurrent application dueto less resource allocated to object tracking object detection, etc.

There are various manners in which the contention level may begenerated. In an example the contention level may be a measure of CPU,GPU, Memory, or other computer resource utilization which is not idleand not dedicated to the adaptive object detection framework.Alternatively or in addition, the contention level may be a ratio orpercentage of computer resource utilization between the adaptive objectdetection framework and other non-idle tasks, application, processes,etc.

In various experimentation CPU contention, CPU contention level was aninteger between 0 and 6 representing the number of cores that thesynthetic contention generator (CG) occupies. For memory bandwidthcontention, the scale was integer between 0 to 40000 with a unit of MB/sthat represents the memory bandwidth that the synthetic CG occupies. ForGPU contention, the scale was floating point number between 0 and 1which represents the percentage of GPU cores that the synthetic CGtakes.

The contention sensor 116 may capture utilization metric of a computerresource, or group of computer resource. The computer resource mayinduce a hardware and/or virtual resource. For example, the resource mayinclude a measure of memory usage, CPU usage, disk storage usage, orcombination thereof. The metric may be expressed as a percentage, acapacitance (i.e. #of bytes), a latency (i.e. milliseconds or the like).Alternatively, or in addition, the computer resource may include asoftware-based resource and the utilization metric may include a measureof operation or execution, such as number of threads, processes, orother parameters made available through an operating system to measureoperating system performance and/or execution.

Although one can theoretically get the ground truth of the resourcecontention by probing the system and directly measuring CPU, memorybandwidth and GPU usage by other processes, it is not practical. As anormal application in the user space, it is difficult to collect theexact resource information from other processes. The hardware is alsolacking sufficient support for such fine-grained measurement on mobileor embedded devices. In contrast, the offline latency log under variouscontention levels and the online latency log of the current branch inthe past few runs are a natural observation of the contention level.Thus, we proposed the log-based contention sensor.

In some examples, the contention sensor 116 may be a log-basedcontention sensor. The log-based contention sensor may find a contentionlevel where the offline latency log matches the averaged online latencymost closely. We use the nearest-neighbor principle to search for suchcontention levels in our pre-defined orthogonal 3D contention space.

Contention space is the search space of all possible contention levels.To estimate the current contention level, the contention sensor observesthe current averaged online latency of the adaptive object detectionframework (or sub-components thereof, such as the multi-branchdetector). It then checks the offline latency log and estimates thecontention level as the one under which the observed online latency isclosest to the offline latency.

As multiple contention levels may cause the same impact on the latencyof a given AB, we call it a cluster of contention levels and we pick onelevel out of it as the representative. In comparison to some previouswork in the systems community, the contention sensor described herein islightweight, efficient, and does not require additional privileges atsystem level, making it a more practical offering in real-world systems.

The scheduler may forecast latency metrics for execution configurationsets (306). The latency metric may measure the end-to-end latency of theobject detection for detecting the objects in a video frame and averagedacross all the frames of the video, which essentially maps to the entirelength of the video. Typically, this will be in milliseconds forlatency-sensitive applications, and more specifically in the realm of 33msec to 50 msec to support 20-30 frames/sec. Alternatively or inaddition, the latency metric may be expressed as a percentile, such as ap50, p75, p99 etc.

Each execution configuration set may include a unique combination oftuning parameters and cause the multi-branch object detector to performa different AB. The scheduler may access the latency model and determinethe latency measurement based on content features extracted from videoinformation and contention level obtained from the contention sensor.

The scheduler may forecast accuracy metrics for the executionconfiguration sets (308). The scheduler may access the accuracy modeland generate the accuracy metrics based on the content features. Theaccuracy metric may measure the mean average precision (mAP) of thebounding box placement relative to the ground truth bounding boxes andtypically an IoU (intersection-over-union of the output bounding boxversus the ground truth bounding box) value of 50% or higher isconsidered accurate enough. For mAP, the average IoU of all the boundingboxes across all the video frames is averaged by the total number ofbounding boxes. The exact mAP computation may vary somewhat fromprotocol to protocol but the output will typically be a percentage andthe higher the percentage the better. The best algorithms will output anmAP of 95% and higher for example. The mAP will drop for morechallenging videos and for more stringent latency SLA (service levelagreements), where a more stringent latency SLA will mean some sacrificeof the accuracy metric such as by approximating aggressively.

The scheduler may select an execution configuration from the domain ofexecution configuration 310. The accuracy and latency metrics associatedwith the selected execution configuration may satisfy a performancecriterion provided to the scheduler. For example, the scholar mayreceive, via user input or some other source, the performance criterion.The performance criterion may have a rule that compares the accuracyand/or latency metrics to predefined threshold values or evaluates themetrics under predefined logic to provide an indication of acceptance,such as a Boolean value or the like. If the criterion is satisfied, thenthe execution configuration is selected for the multi-branch objectdetector.

The schedule may cause object detection, object tracking, or acombination thereof based on the selected execution configuration (312).For example, the scheduler may communicate the execution configurationand/or video information to the multi-branch object detector 102 forprocessing.

FIG. 4 illustrates example logic for the multi-branch object detector102. The multibranch object detector may receive video information(402). The video information may include, among other information, avideo frame or multiple frames. The multi-branch object detector mayobtain an execution configuration (404). For example, the executionconfiguration may be received from the scheduler.

The multi-branch object detector 102 may receive a sampling rule. Thesampling rule may include or evaluate the sampling interval previouslydescribed, or any other information indicative of a rule for switchingbetween detection and tracking and the object tracker. In some examples,the execution configuration may include a sampling rule.

The multi-branch object detector 102 may select object detection orobject tracking depending on the sampling rule (408). In response toselection of object detection, the multi-branch object detector 102 mayperform object detection utilizing the detection DNN and theconfiguration parameters included in the execution configuration (410).In response to selection of object tracking, the multi-branch objecttracker may select an object tracker (412). For example, the executionconfirmation may specify the object tracker to select. The multi-branchobject detector may perform object tracking according to the parametersin the execution configuration (414).

After the completion of the object tracking and/or object detection, themulti-branch object detector may output the results (416). The resultsmay include the output of the object detector and/or the object tracker.For example, the results my include coordinates of bounding box(es), anobject identification in bounding box(es), class probabilities of theobjects contained in the bounding boxes, or a combination thereof.

The steps illustrated in the flow logic herein may include additional,different, or fewer operations than illustrated in FIG. 3 and FIG. 4.The steps may be executed in a different order than illustrated.Moreover, the system may be implemented with additional, different, orfewer components than illustrated in FIG. 1 and FIG. 2. Each componentmay include additional, different, or fewer components.

FIG. 5 illustrates a second example of the system 100. The system 100may include communication interfaces 812, input interfaces 828 and/orsystem circuitry 814. The system circuitry 814 may include a processor816 or multiple processors. Alternatively, or in addition, the systemcircuitry 814 may include memory 820.

The processor 816 may be in communication with the memory 820. In someexamples, the processor 816 may also be in communication with additionalelements, such as the communication interfaces 812, the input interfaces828, and/or the user interface 818. Examples of the processor 816 mayinclude a general processor, a central processing unit, logicalCPUs/arrays, a microcontroller, a server, an application specificintegrated circuit (ASIC), a digital signal processor, a fieldprogrammable gate array (FPGA), and/or a digital circuit, analogcircuit, or some combination thereof.

The processor 816 may be one or more devices operable to execute logic.The logic may include computer executable instructions or computer codestored in the memory 820 or in other memory that when executed by theprocessor 816, cause the processor 816 to perform the operations theadaptive object detection framework 101, the multi-branch objectdetector 102, scheduler 104, and/or the system 100. The computer codemay include instructions executable with the processor 816.

The memory 820 may be any device for storing and retrieving data or anycombination thereof. The memory 820 may include non-volatile and/orvolatile memory, such as a random access memory (RAM), a read-onlymemory (ROM), an erasable programmable read-only memory (EPROM), orflash memory. Alternatively or in addition, the memory 820 may includean optical, magnetic (hard-drive), solid-state drive or any other formof data storage device. The memory 820 may include at least one of theadaptive object detection framework 101, the multi-branch objectdetector 102, scheduler 104. Alternatively or in addition, the memorymay include any other component or sub-component of the system 100described herein.

The user interface 818 may include any interface for displayinggraphical information. The system circuitry 814 and/or thecommunications interface(s) 812 may communicate signals or commands tothe user interface 818 that cause the user interface to displaygraphical information. Alternatively or in addition, the user interface818 may be remote to the system 100 and the system circuitry 814 and/orcommunication interface(s) may communicate instructions, such as HTML,to the user interface to cause the user interface to display, compile,and/or render information content. In some examples, the contentdisplayed by the user interface 818 may be interactive or responsive touser input. For example, the user interface 818 may communicate signals,messages, and/or information back to the communications interface 812 orsystem circuitry 814.

The system 100 may be implemented in many different ways. In someexamples, the system 100 may be implemented with one or more logicalcomponents. For example, the logical components of the system 100 may behardware or a combination of hardware and software. The logicalcomponents may include the adaptive object detection framework 101, themulti-branch object detector 102, the scheduler, and/or any component orsubcomponent of the system 100. In some examples, each logic componentmay include an application specific integrated circuit (ASIC), a FieldProgrammable Gate Array (FPGA), a digital logic circuit, an analogcircuit, a combination of discrete circuits, gates, or any other type ofhardware or combination thereof. Alternatively or in addition, eachcomponent may include memory hardware, such as a portion of the memory820, for example, that comprises instructions executable with theprocessor 816 or other processor to implement one or more of thefeatures of the logical components. When any one of the logicalcomponents includes the portion of the memory that comprisesinstructions executable with the processor 816, the component may or maynot include the processor 816. In some examples, each logical componentmay just be the portion of the memory 820 or other physical memory thatcomprises instructions executable with the processor 816, or otherprocessor(s), to implement the features of the corresponding componentwithout the component including any other hardware. Because eachcomponent includes at least some hardware even when the includedhardware comprises software, each component may be interchangeablyreferred to as a hardware component.

Some features are shown stored in a computer readable storage medium(for example, as logic implemented as computer executable instructionsor as data structures in memory). All or part of the system and itslogic and data structures may be stored on, distributed across, or readfrom one or more types of computer readable storage media. Examples ofthe computer readable storage medium may include a hard disk, a flashdrive, a cache, volatile memory, non-volatile memory, RAM, flash memory,or any other type of computer readable storage medium or storage media.The computer readable storage medium may include any type ofnon-transitory computer readable medium, such as a CD-ROM, a volatilememory, a non-volatile memory, ROM, RAM, or any other suitable storagedevice.

The processing capability of the system may be distributed amongmultiple entities, such as among multiple processors and memories,optionally including multiple distributed processing systems.Parameters, databases, and other data structures may be separatelystored and managed, may be incorporated into a single memory ordatabase, may be logically and physically organized in many differentways, and may implemented with different types of data structures suchas linked lists, hash tables, or implicit storage mechanisms. Logic,such as programs or circuitry, may be combined or split among multipleprograms, distributed across several memories and processors, and may beimplemented in a library, such as a shared library (for example, adynamic link library (DLL).

All of the discussion, regardless of the particular implementationdescribed, is illustrative in nature, rather than limiting. For example,although selected aspects, features, or components of theimplementations are depicted as being stored in memory(s), all or partof the system or systems may be stored on, distributed across, or readfrom other computer readable storage media, for example, secondarystorage devices such as hard disks and flash memory drives. Moreover,the various logical units, circuitry and screen display functionality isbut one example of such functionality and any other configurationsencompassing similar functionality are possible.

The respective logic, software or instructions for implementing theprocesses, methods and/or techniques discussed above may be provided oncomputer readable storage media. The functions, acts or tasksillustrated in the figures or described herein may be executed inresponse to one or more sets of logic or instructions stored in or oncomputer readable media. The functions, acts or tasks are independent ofthe particular type of instructions set, storage media, processor orprocessing strategy and may be performed by software, hardware,integrated circuits, firmware, micro code and the like, operating aloneor in combination. Likewise, processing strategies may includemultiprocessing, multitasking, parallel processing and the like. In oneexample, the instructions are stored on a removable media device forreading by local or remote systems. In other examples, the logic orinstructions are stored in a remote location for transfer through acomputer network or over telephone lines. In yet other examples, thelogic or instructions are stored within a given computer and/or centralprocessing unit (“CPU”).

Furthermore, although specific components are described above, methods,systems, and articles of manufacture described herein may includeadditional, fewer, or different components. For example, a processor maybe implemented as a microprocessor, microcontroller, applicationspecific integrated circuit (ASIC), discrete logic, or a combination ofother type of circuits or logic. Similarly, memories may be DRAM, SRAM,Flash or any other type of memory. Flags, data, databases, tables,entities, and other data structures may be separately stored andmanaged, may be incorporated into a single memory or database, may bedistributed, or may be logically and physically organized in manydifferent ways. The components may operate independently or be part of asame apparatus executing a same program or different programs. Thecomponents may be resident on separate hardware, such as separateremovable circuit boards, or share common hardware, such as a samememory and processor for implementing instructions from the memory.Programs may be parts of a single program, separate programs, ordistributed across several memories and processors.

A second action may be said to be “in response to” a first actionindependent of whether the second action results directly or indirectlyfrom the first action. The second action may occur at a substantiallylater time than the first action and still be in response to the firstaction. Similarly, the second action may be said to be in response tothe first action even if intervening actions take place between thefirst action and the second action, and even if one or more of theintervening actions directly cause the second action to be performed.For example, a second action may be in response to a first action if thefirst action sets a flag and a third action later initiates the secondaction whenever the flag is set.

To clarify the use of and to hereby provide notice to the public, thephrases “at least one of <A>, <B>, . . . and <N>” or “at least one of<A>, <B>, <N>, or combinations thereof” or “<A>, <B>, . . . and/or <N>”are defined by the Applicant in the broadest sense, superseding anyother implied definitions hereinbefore or hereinafter unless expresslyasserted by the Applicant to the contrary, to mean one or more elementsselected from the group comprising A, B, . . . and N. In other words,the phrases mean any combination of one or more of the elements A, B, .. . or N including any one element alone or the one element incombination with one or more of the other elements which may alsoinclude, in combination, additional elements not listed.

While various embodiments have been described, it will be apparent tothose of ordinary skill in the art that many more embodiments andimplementations are possible. Accordingly, the embodiments describedherein are examples, not the only possible embodiments andimplementations.

What is claimed is:
 1. A method, comprising receiving video informationperforming object detection and object tracking based on an executionconfiguration; approximating an optimized execution configuration basedon computer resource contention and content in of the video informationby: identifying, based on the video information, a plurality of contentfeatures; measuring a contention level of a computer resource;forecasting, based on the content features and the measured contentionlevel, latency metrics for a plurality of execution configuration sets,respectively; forecasting, based on the content features, accuracymetrics for the execution configuration sets, respectively; andselecting an optimized execution configuration set from the executionconfiguration sets in response to satisfaction of a performancecriterion; and performing object detection and object tracking based onthe optimized execution configuration.
 2. The method of claim 1, whereinthe optimized execution configuration comprises a sampling rule, whereinperforming object detection and object tracking based on the optimizedexecution configuration further comprises: executing either objectdetection or object tracking based on a sampling rule.
 3. The method ofclaim 2, wherein the sampling rule includes a sampling interval whichspecifies how often to perform object detection, object tracking, or acombination thereof.
 4. The method of claim 1, wherein the optimizedexecution configuration comprises an input shape and a number ofproposals, wherein performing object detection and object tracking basedon the optimized execution configuration further comprises: accessing aninput shape and a number of proposals from the optimized configurationset; providing the input shape, the number of proposals, and the videoinformation to a deep neural network; and determining an objectclassification based on output of the neural network.
 5. The method ofclaim 1, wherein the optimized execution configuration comprises atracker type parameter, wherein performing object detection and objecttracking based on the optimized execution configuration furthercomprises: accessing the tracker type parameter from the optimizedexecution configuration; selecting, from a plurality of available objecttrackers, a second object tracker corresponding to the tracker type; andperforming object tracking with the second object tracker.
 6. The methodof claim 1, wherein the optimized execution configuration comprises adown-sampling parameter, wherein performing object detection and objecttracking based on the optimized execution configuration furthercomprises: accessing the down-sampling ratio parameter from the selectedexecution configuration set; and down-sampling the video informationbased on the down-sampling ratio.
 7. The method of claim 1, whereinforecasting, based on the content features and the measured contentionlevel, latency metrics for a plurality of execution configuration setsfurther comprises: providing the content features and the contentionlevel to a machine learning model trained based on training informationcomprising execution configuration sets, historical content features,historical contention levels, and historical latency metrics.
 8. Themethod of claim 1, wherein forecasting, based on the content features,accuracy metrics for the execution configuration sets, respectively,further comprises: providing the content features to a machine learningmodel trained based on training information comprising historicalcontent features and historical latency metrics.
 9. The method of claim1, wherein to obtaining the contention level further comprises measuringusage of a physical or virtualized hardware on which is accessed byperformance of the object detection and object tracking.
 10. A systemcomprising a processor, the processor configured to: receive videoinformation perform object detection and object tracking based on anexecution configuration; approximate an optimized executionconfiguration based on computer resource contention and content in ofthe video information, wherein to approximate the optimized executionconfiguration, the processor is configured to: identify, based on thevideo information, a plurality of content features; measure a contentionlevel of a computer resource, approximate, based on the content featuresand the contention level, latency metrics for a plurality of executionconfiguration sets, respectively, approximate, based on the contentfeatures, accuracy metrics for the execution configuration sets,respectively, and select the optimized execution configuration set fromthe execution configuration sets in response to satisfaction of aperformance criterion; and perform object detection and object trackingbased on the optimized execution configuration set.
 11. The system ofclaim 10, wherein the optimized execution configuration comprises asampling rule, wherein to perform object detection and object trackingbased on the optimized execution configuration, the processor is furtherconfigured to: execute either object detection or object tracking basedon a sampling rule.
 12. The system of claim 11, wherein the samplingrule includes a sampling interval which specifies how often to performobject detection, object tracking, or a combination thereof.
 13. Thesystem of claim 10, wherein the optimized execution configurationcomprises an input shape and a number of proposals, wherein to performobject detection and object tracking based on the optimized executionconfiguration, the processor is further configured to: access an inputshape and a number of proposals from the optimized configuration set;provide the input shape, the number of proposals, and the videoinformation to a deep neural network; and determine an objectclassification based on output of the neural network.
 14. The system ofclaim 10, wherein the optimized execution configuration comprises atracker type parameter, wherein to perform object detection and objecttracking based on the optimized execution configuration, the processoris further configured to: access the tracker type parameter from theoptimized execution configuration; select, from a plurality of availableobject trackers, a second object tracker corresponding to the trackertype; and perform object tracking with the second object tracker. 15.The system of claim 10, wherein the optimized execution configurationcomprises a down-sampling parameter, wherein to perform object detectionand object tracking based on the optimized execution configuration, theprocessor is further configured to: access the down-sampling ratioparameter from the selected execution configuration set; and down-samplethe video information based on the down-sampling ratio.
 16. The systemof claim 10, wherein to approximate, based on the content features andthe measured contention level, latency metrics for a plurality ofexecution configuration sets, the processor is further configured to:provide the content features and the measured contention level to amachine learning model trained based on training information comprisingexecution configuration sets, historical content features, historicalcontention levels, and historical latency metrics;
 17. The system ofclaim 10, wherein to approximate, based on the content features,accuracy metrics for the execution configuration sets, respectively, theprocessor is further configured to: provide the content features to amachine learning model trained based on training information comprisinghistorical content features and historical latency metrics.
 18. Thesystem of claim 1, wherein to obtain a contention level, the processoris further configured to: measure usage of a physical or virtualizedhardware on which is accessed by performance of the object detection andobject tracking.